Skip to content

Batch attestation duty scheduling for an epoch#9374

Merged
StefanBratanov merged 11 commits intoConsensys:masterfrom
StefanBratanov:batch_attestation_duty_scheduling
May 14, 2025
Merged

Batch attestation duty scheduling for an epoch#9374
StefanBratanov merged 11 commits intoConsensys:masterfrom
StefanBratanov:batch_attestation_duty_scheduling

Conversation

@StefanBratanov
Copy link
Copy Markdown
Contributor

@StefanBratanov StefanBratanov commented Apr 25, 2025

PR Description

Introduce AttestationDutyDefaultSchedulingStrategy and AttestationDutyBatchSchedulingStrategy

Batching by slots with added delays is used when there are lot of duties to be scheduled (excessive signing at the beginning of an epoch)

Fixed Issue(s)

fixes #9292

Documentation

  • I thought about documentation and added the doc-change-required label to this PR if updates are required.

Changelog

  • I thought about adding a changelog entry, and added one if I deemed necessary.

@StefanBratanov StefanBratanov marked this pull request as draft April 25, 2025 15:57
Comment on lines +49 to +50
private static final int BATCH_SIZE = 2500;
private static final Duration BATCH_DELAY = Duration.ofSeconds(3);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking more slots...

2500 is a lot (but may be ok if its configurable)

Theres some use cases

  • absolute worst case is bootstrapping and a large validator set. (would include a range sync here)
  • best case is we're already running, because we're just padding out our current set of duties

im not sure if we need to handle these separately, but basically, i think something slot based should solve both scenarios

  • any slots in the 'past' (by time computation) can be excluded
  • current slot needs to be known ASAP (if we're late into that we may as well not sub)
  • assuming current slot is not in our list of duties (this would be if we already have an epoch of duties computed and we're filling in epoch+1), then we can just do 1 slot of signatures and then sleep 1, then do another slot.
  • if we're computing current epoch duties, then we probably need to be doing 3-4 slots in a batch...

so at start, we have 2 epochs of duties to figure out if we're aggregators

  • exclude all 'historic' slots
  • order by slot
  • if we're current epoch, iterate 1 slot at a time for signatures. this gives gaps while we wait for that slot to complete if we're trying to get current duties in. sleep 500ms every period (3-4 slots)
    • so if we have slots 10-32 of our current to consider, then we iterate those in batches, but 1 slot at a time (sleeping 1s every ... say 3rd-4th... slot)
  • from 33-64, we would just do a single slot then sleep 1s. this would spread the sigs over a few slots.

If we have less than about 1k validators we likely don't care for this overhead, but once we get to large numbers like some of our test infra, this would help a lot.

On 32k validators, that looks like on average 1k / slot, so by spacing them out we get space to do things like sign actual duties.

Concretely,
eg1.
slot 32, i have already computed my current epoch (from the previous time) so i get only my future epoch. I have 32k sigs to do so i do 1k (average) sleep 1s, repeat and it takes me 32 seconds to get slots 33-64 aggregation duties.

eg2.
Slot 1, if i haven't computed any duties before, then i have 64 duties (current and next epoch)
i do slot 1 computes straight away, in the hope i'm not too late. then without a rest i do 2, then 3. that gives me all the data i need for the first few slots.
I sleep 500ms, do slots 4,5, then 6. rinse, repeat until i get to 33. (would take... 6 seconds ish of sleeps)
at 33, i do that, sleep 1s. 34, sleep 1s. repeat. (would take over half a minute not including signing time)

eg3.
slot 29, i haven't computed duties before, i get duties from current/next epoch.
I exclude 1-28, as they're historic and i've already missed those aggregating tasks.
I compute for 29. compute 30, 31, sleep 500ms.
then compute 32 which is last of current epoch, so sleep 500ms.
then 33, sleep 1s etc.

I guess theres still the edge case where you have more than 1k sigs per slot but i think at that point we have bigger issues...

This was a big comment, but kind of encapsulates some considerations i guess.

@StefanBratanov StefanBratanov force-pushed the batch_attestation_duty_scheduling branch from dd65c65 to bc5d55a Compare April 28, 2025 15:48
Comment on lines +123 to +124
final boolean isCurrentEpoch =
spec.computeEpochAtSlot(currentSlot.get()).isLessThanOrEqualTo(epoch);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahm... so current epoch should be just equals?

@StefanBratanov StefanBratanov marked this pull request as ready for review April 29, 2025 16:02
// expensive aggregation slot signing in beginning of the epoch when a node is running a large
// number of validators
@VisibleForTesting
record SlotBatchingOptions(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i like that idea

Copy link
Copy Markdown
Contributor

@rolfyone rolfyone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we should see how it goes. I wonder if we also want to take timings of each batch between sleeps but that could definitely be separate to this PR.

@StefanBratanov
Copy link
Copy Markdown
Contributor Author

@lucassaldanha I am not 100% sure how DvtAttestationAggregations works but it looks like getCombinedSelectionProofFuture does not complete until all other duties reach that stage. In the case of batching, I am not sure how to make this work. Any idea or I am misunderstanding possibly how it works.

@StefanBratanov StefanBratanov force-pushed the batch_attestation_duty_scheduling branch 3 times, most recently from 7698253 to f647aeb Compare May 14, 2025 10:42
@StefanBratanov
Copy link
Copy Markdown
Contributor Author

After:
Slow Block Publishing *** Slot: 418048 start 728ms
Before:
Slow Block Publishing *** Slot: 400512 start 5263ms

@StefanBratanov StefanBratanov force-pushed the batch_attestation_duty_scheduling branch from 08e08a7 to f9359c3 Compare May 14, 2025 14:45
@StefanBratanov StefanBratanov enabled auto-merge (squash) May 14, 2025 14:45
@StefanBratanov StefanBratanov merged commit 03b0df7 into Consensys:master May 14, 2025
16 checks passed
rolfyone pushed a commit to rolfyone/teku that referenced this pull request May 20, 2025
zilm13 added a commit that referenced this pull request May 20, 2025
* Add read functionality to blobs archiving (#9318)

* Schedule Gnosis Pectra hard-fork (#9340)

* Updated ref tests v1.5.0-beta.5 (#9341)

* Created Fulu skeleton (#9325)

* Made checkpoint-state-url and initial-state mutually exclusive (#9342)

It's hard to reason on how these two should interact, and logically they're mutually exclusive, so this just solidifies the theory that they shouldn't be used together.

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* partial 3rd party updates and errorprone updates (#9351)

I ended up disabling the instanceof errorprone, as it's very very common for us to be doing it apparently. I started changing things but in the interest of time will raise a ticket

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Make builder timeouts only for the HTTP call (#9353)

Co-authored-by: Lucas Saldanha <lucas.saldanha@consensys.net>

* more 3rd party updates (#9355)


Signed-off-by: Paul Harris <paul.harris@consensys.net>

* KZG updates from das branch (#9335)

* Avoid combining validators by aggregating committee when using DVT (#9357)

* builder json format, make media type more compatible (#9360)

* refactored expected withdrawals and added test cases (#9361)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Use `URI.create(...).toURL()` removing URL constructor deprecation (#9364)

* clear-changelog (#9366)

* New infra stream module (#9362)

* feat: Add support for reading static peers from file (#9328)

Co-authored-by: Paul Harris <paul.harris@consensys.net>

* Start cleaning up `api/schema` (#9376)

* Start cleaning up `api/schema`

 - moved `ValidatorStatus`
 - Removed `Version`
 - Moved `PublicKeyException`
 - Moved `EventType`

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* cleanup the rest of api/schema (#9377)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Added KZG computeCells method (#9375)

* Added KZG computeCells method
* Moved KZG no-operation logic into testFixture class

* Fulu Schemas (ssz + datastructures) - including fulu ssz ref tests (#9363)

* Added context to the sync committee duties error (#9380)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Returning custody_group_count on node/identity Beacon API (#9381)

* may 3rd party updates (#9388)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Added snakeyaml to allow more complicated config reading (#9390)

 - added tests to show reading lists of objects (BPO related)
 - added a testcase to show that hex is read correctly
 - added testcase to show specConfigReader isnt filling defaults when we read local config files

 Partially addresses #9365

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Add FULU to the Spec Factory (#9373)

* Improve attestation bits aggregator electra (#9393)

* added a DeserializableConfigTypeDefinition (#9396)

- Tests demonstrate that this should at least be very close to what we require for reading BPO schedule.

 Partially addresses #9365

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Updating ref test to 1.5.0 stable (#9398)

* uniforms the AggregatingAttestationPool interface (#9401)

* Avoids potential mutability of the result of getCommitteeBits (#9402)

* Adding DataColumnsByRootIdentifier container (#9399)

* Added an info message for the highest milestone (#9405)

Also added a sanity check that will fail if the specFactory hasn't defined the highest milestone correctly, which has flow on effects.

fixes #9400

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Introduce `PooledAttestation` and `PooledAttestationWithData` (#9407)

* Improve jdk 24 support and add docker-jdk24 (#9410)

* add docker-jdk24 support

* solves jdk 24 startup warning

* rename AttestationBitsAggregator to AttestationBits (#9412)

* Introduce aggregating pool interface (#9414)

* [docker-jdk24] SIGHUP handler fix (#9413)

* Optionally include validator indices in `PooledAttestation` (#9418)

* Improve committeesSize retrieval in `AggregatingAttestationPool` (#9419)

* Batch attestation duty scheduling for an epoch (#9374)

* Updated MiscHelpersFulu + Added Gossip Logger (#9422)

* Updated MiscHelpersFulu
* Added GossipLogger and DAS related changes

* gossip DAS related changes

* fix test

* fix runtime with noop fulu managers

* bump MAX_SUBSCRIBED_TOPICS for Fulu and further forks

* Use master's circle ci config

* Remove duplicated workflows

* Update test fixtures with random order changes

* fix test

* fix test

* fix GetNewBlockV3Test

* fix GetBlockTest

* Gossip DAS related changes (#9423)

* added bpo parsing to configuration (#9406)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Fix ProduceBlockRequestTest for Fulu and updated order in random

* Regenerate test fixtures with the changes in DataStructureUtil call order

* Regenerate test fixtures with the changes in DataStructureUtil call order for GetNewBlockV3IntegrationTest

* spotless

* move GetDataColumnSidecars to tekuv1 space, fix openapi integration test

* BPO - remove max blobs from fulu spec (#9427)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Introduces `RewardBasedAttestationSorter` (#9428)

* Fix integration tests after master upstream

* Fix property tests + refactor some random datastructure utils to avoid fulu duplication

* fix assemble

* fix test

* revert das test coverage changes

* DAS storage changes (#9432)

* Added config defaulting to the Config loader (#9439)

Signed-off-by: Paul Harris <paul.harris@consensys.net>
Co-authored-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* Introduce `AttestationMatchingGroupV2` (#9438)

* Fix DasSyncAcceptanceTest, fix DSL, remove broken DSL

---------

Signed-off-by: Paul Harris <paul.harris@consensys.net>
Co-authored-by: Stefan Bratanov <stefan.bratanov93@gmail.com>
Co-authored-by: Lucas Saldanha <lucas.saldanha@consensys.net>
Co-authored-by: Paul Harris <paul.harris@consensys.net>
Co-authored-by: Enrico Del Fante <enrico.delfante@consensys.net>
Co-authored-by: crStiv <cryptostiv7@gmail.com>
Co-authored-by: Mehdi AOUADI <mehdi.aouadi@consensys.net>
Co-authored-by: Lucas Saldanha <lucascrsaldanha@gmail.com>
Co-authored-by: Gabriel Fukushima <gabrielfukushima@gmail.com>
zilm13 added a commit that referenced this pull request May 21, 2025
* Add read functionality to blobs archiving (#9318)

* Schedule Gnosis Pectra hard-fork (#9340)

* Updated ref tests v1.5.0-beta.5 (#9341)

* Created Fulu skeleton (#9325)

* Made checkpoint-state-url and initial-state mutually exclusive (#9342)

It's hard to reason on how these two should interact, and logically they're mutually exclusive, so this just solidifies the theory that they shouldn't be used together.

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* partial 3rd party updates and errorprone updates (#9351)

I ended up disabling the instanceof errorprone, as it's very very common for us to be doing it apparently. I started changing things but in the interest of time will raise a ticket

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Make builder timeouts only for the HTTP call (#9353)

Co-authored-by: Lucas Saldanha <lucas.saldanha@consensys.net>

* more 3rd party updates (#9355)


Signed-off-by: Paul Harris <paul.harris@consensys.net>

* KZG updates from das branch (#9335)

* Avoid combining validators by aggregating committee when using DVT (#9357)

* builder json format, make media type more compatible (#9360)

* refactored expected withdrawals and added test cases (#9361)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Use `URI.create(...).toURL()` removing URL constructor deprecation (#9364)

* clear-changelog (#9366)

* New infra stream module (#9362)

* feat: Add support for reading static peers from file (#9328)

Co-authored-by: Paul Harris <paul.harris@consensys.net>

* Start cleaning up `api/schema` (#9376)

* Start cleaning up `api/schema`

 - moved `ValidatorStatus`
 - Removed `Version`
 - Moved `PublicKeyException`
 - Moved `EventType`

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* cleanup the rest of api/schema (#9377)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Added KZG computeCells method (#9375)

* Added KZG computeCells method
* Moved KZG no-operation logic into testFixture class

* Fulu Schemas (ssz + datastructures) - including fulu ssz ref tests (#9363)

* Added context to the sync committee duties error (#9380)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Returning custody_group_count on node/identity Beacon API (#9381)

* may 3rd party updates (#9388)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Added snakeyaml to allow more complicated config reading (#9390)

 - added tests to show reading lists of objects (BPO related)
 - added a testcase to show that hex is read correctly
 - added testcase to show specConfigReader isnt filling defaults when we read local config files

 Partially addresses #9365

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Add FULU to the Spec Factory (#9373)

* Improve attestation bits aggregator electra (#9393)

* added a DeserializableConfigTypeDefinition (#9396)

- Tests demonstrate that this should at least be very close to what we require for reading BPO schedule.

 Partially addresses #9365

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Updating ref test to 1.5.0 stable (#9398)

* uniforms the AggregatingAttestationPool interface (#9401)

* Avoids potential mutability of the result of getCommitteeBits (#9402)

* Adding DataColumnsByRootIdentifier container (#9399)

* Added an info message for the highest milestone (#9405)

Also added a sanity check that will fail if the specFactory hasn't defined the highest milestone correctly, which has flow on effects.

fixes #9400

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Introduce `PooledAttestation` and `PooledAttestationWithData` (#9407)

* Improve jdk 24 support and add docker-jdk24 (#9410)

* add docker-jdk24 support

* solves jdk 24 startup warning

* rename AttestationBitsAggregator to AttestationBits (#9412)

* Introduce aggregating pool interface (#9414)

* [docker-jdk24] SIGHUP handler fix (#9413)

* Optionally include validator indices in `PooledAttestation` (#9418)

* Improve committeesSize retrieval in `AggregatingAttestationPool` (#9419)

* Batch attestation duty scheduling for an epoch (#9374)

* Updated MiscHelpersFulu + Added Gossip Logger (#9422)

* Updated MiscHelpersFulu
* Added GossipLogger and DAS related changes

* Gossip DAS related changes (#9423)

* added bpo parsing to configuration (#9406)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* BPO - remove max blobs from fulu spec (#9427)

Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Introduces `RewardBasedAttestationSorter` (#9428)

* DAS storage changes (#9432)

* Added config defaulting to the Config loader (#9439)

Signed-off-by: Paul Harris <paul.harris@consensys.net>
Co-authored-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* Introduce `AttestationMatchingGroupV2` (#9438)

* Introduce `AggregatingAttestationPoolV2` (#9445)

* Introduce `AggregatingAttestationPoolProfiler` implementations (#9447)

Co-authored-by: Paul Harris <paul.harris@consensys.net>

* PeerDAS RPC related changes (#9444)

* Added a development flag to enable strict config loading (#9449)

Strict loading was the default until yesterday, but it's caused a number of problems because if we push changes too fast etc it causes support issues.

The new default permissive loading would mean that if a passed config file is missing an entry, we default it and print a warning.

This new flag `--Xstartup-strict-config-loader-enabled` defaults to false, but if turned on will revert to failing on startup if a config being loaded doesn't include all of the expected configuration attributes.

We could potentially start using this in AT to specify minimal changes to a configuration for a test, but for now I've left acceptance tests using strict loading.


Signed-off-by: Paul Harris <paul.harris@consensys.net>

* Updating reference tests to 1.6.0-alpha.0 (#9450)

* Updating reference tests to 1.6.0-alpha.0

* Added Fulu networking tests

* rename

* Refactoring networking ATs

* Add peer via api (#9431)

* add initial impl of endpoint

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* add catch to cover address conversion failure

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* add tests for 200, 400 and 500

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* spotless

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* add test for when discovery is not enabled

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* spotless

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* fix name of test after changing exception returned for invalid peer

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* add json def to paths

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* add changelog

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* PR review

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* spotless caught me again

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* change impl and tests to handle a single peer per api call

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* spotless

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

---------

Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>

* `AggregatingAttestationPool` new CLI params (#9452)

* poolV2-CLI

* fix arity

* getBits method names clearer (#9453)

* Validate electra attestationData index received by BN (#9430)

* fix integration test

---------

Signed-off-by: Paul Harris <paul.harris@consensys.net>
Signed-off-by: Gabriel Fukushima <gabrielfukushima@gmail.com>
Co-authored-by: Stefan Bratanov <stefan.bratanov93@gmail.com>
Co-authored-by: Lucas Saldanha <lucas.saldanha@consensys.net>
Co-authored-by: Paul Harris <paul.harris@consensys.net>
Co-authored-by: Enrico Del Fante <enrico.delfante@consensys.net>
Co-authored-by: crStiv <cryptostiv7@gmail.com>
Co-authored-by: Mehdi AOUADI <mehdi.aouadi@consensys.net>
Co-authored-by: Gabriel Fukushima <gabrielfukushima@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Be smarter in scheduling aggregation duty

2 participants